17 research outputs found

    SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering

    Full text link
    Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheet version information. Thus, the spreadsheet version information is missing, and different versions of a spreadsheet coexist as individual and similar spreadsheets. Existing approaches try to recover spreadsheet version information through clustering these similar spreadsheets based on spreadsheet filenames or related email conversation. However, the applicability and accuracy of existing clustering approaches are limited due to the necessary information (e.g., filenames and email conversation) is usually missing. We inspected the versioned spreadsheets in VEnron, which is extracted from the Enron Corporation. In VEnron, the different versions of a spreadsheet are clustered into an evolution group. We observed that the versioned spreadsheets in each evolution group exhibit certain common features (e.g., similar table headers and worksheet names). Based on this observation, we proposed an automatic clustering algorithm, SpreadCluster. SpreadCluster learns the criteria of features from the versioned spreadsheets in VEnron, and then automatically clusters spreadsheets with the similar features into the same evolution group. We applied SpreadCluster on all spreadsheets in the Enron corpus. The evaluation result shows that SpreadCluster could cluster spreadsheets with higher precision and recall rate than the filename-based approach used by VEnron. Based on the clustering result by SpreadCluster, we further created a new versioned spreadsheet corpus VEnron2, which is much bigger than VEnron. We also applied SpreadCluster on the other two spreadsheet corpora FUSE and EUSES. The results show that SpreadCluster can cluster the versioned spreadsheets in these two corpora with high precision.Comment: 12 pages, MSR 201

    VEnron1.0

    No full text
    <div>VEnron1.0 is an industrial-scale and public spreadsheet corpus with version information, including 360 evolution groups and 7,294 spreadsheets. (Multiple versions originated from the same spreadsheet are considered as an evolution group.)</div><div>VEnron1.0 is published by our ICSE SEIP 2016 paper.</div><div><br></div><div><b>Wensheng Dou</b>, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, Tao Huang. VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis. In <i>Proceedings of the 38th International Conference on Software Engineering</i> (<b><i>ICSE SEIP 2016</i></b>), pages 162-171, Austin, TX, USA, May 2016.<br></div

    VEnron2

    No full text
    <div>VEnron2 is an industrial-scale and public spreadsheet corpus with version information, including 1,609 evolution groups and 12,254 spreadsheets. (Multiple versions originated from the same spreadsheet are considered as an evolution group.)<br></div><div>VEnron2 is a big improvement to VEnron1.1. We extrace much more evolution groups and spreadsheets from the Enron email archive, by using SpreadCluster.</div><div>VErnon2 is published associated with our MSR 2017 paper in May 2017.</div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.</div

    VEUSES

    No full text
    <div>EUSES is the most frequently used spreadsheet corpus, and contains 4,037 spreadsheets. These spreadsheets were extracted from World Wide Web. </div><div>We applied SpreadCluster to the EUSES and manually validated all groups. Based on the validated result, we built the VEUSES corpus, containing 177 evolution groups and 363 spreadsheets.</div><div>VEUSES is published associated with our MSR 2017 paper in May 2017. </div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.<br></div

    VEnron1.1

    No full text
    <div>VEnron1.1 is an industrial-scale and public spreadsheet corpus with version information, including 322 evolution groups and 7,171 spreadsheets. (Multiple versions originated from the same spreadsheet are considered as an evolution group.)</div><div>VEnron1.1 is an improvement to VEnron1.0. We fix some errors in VEnron1.0, and also design a simply layout structure to store versioned spreadsheets.</div><div>VErnon1.1 is published associated with our MSR 2017 paper in May 2017.</div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.<br></div

    VFUSE

    No full text
    <div><br></div><div>FUSE is a reproducible, internet-scale corpus, and contains 249,376 unique spreadsheets that were extracted from over 26.83 billion pages. </div><div>We applied SpreadCluster to the FUSE and manually validated 200 groups that were randomly selected from the clustering result. Based on the validated result, we built the VFUSE corpus, containing 188 evolution groups and 1,143 spreadsheets.</div><div>VFUSE is published associated with our MSR 2017 paper in May 2017. </div><div><br></div><div>Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao Huang. SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. In <i>Proceedings of the 14th International Conference on Mining Software Repositories</i> (<b><i>MSR 2017</i></b>), May 2017.<br></div

    CACheck: Detecting and Repairing Cell Arrays in Spreadsheets

    No full text

    Mining Vehicles Frequently Appearing Together from Massive Passing Records

    No full text
    Vehicles Frequently Appearing Together, or VFATs, can be clues in solving criminal cases. Traditional sequence mining approaches help identify VFATs from passing-through records collected at monitoring sites. However, huge traffic data streams hinder fast identification of VFATs. In this paper, we present a multi-threaded approach to fast identification of VFATs based on multi-core processors, called Frequent Sequential Mining based on Multi-Cores (FSMMC). It parallels the execution of tasks, partitions large volumes of data, and obtains VFATs by merging local candidates discovered in different threads running on different processor cores. Through local parallel reduction, FSMMC eliminates the repetitive patterns and reduces computational effort. Moreover, it achieves workload balance by the dynamic distribution of tasks to a pool of threads where the thread that finishes first joins another running thread. Both theoretical analysis and case studies show that FSMMC takes full advantage of multi-core computing platforms and has higher speed-up when searching VFATs among massive passing through records, compared with other approaches without multithreading
    corecore